Research Compendium

Disclaimer

Compendium



“A collection of concise but detailed information about a particular subject”


A comprehensive archive summarizing your project (data/analysis/code).

Research Compendium


  • A few simple rules

  • for you and others

  • so that you can share your code, data, results with your students, supervisors, collaborators and/or the scientific community

The problem “I have my own organisation”

Here enters …
the ‘Research Compendium’

Research compendium

le but ?

The goal of a research compendium is to provide a standard and easily recognisable way for organising the digital materials of a project to enable others to inspect, reproduce, and extend the research.

Three Generic Principles

  1. Organize files according to prevailing conventions: i) help other people recognize the structure of the project, ii) supports tool building which takes advantage of the shared structure.
  1. Separate data, method, and output, while making the relationship between them clear.
  1. Specify the computational environment that was used for the original analysis.

1. Organisation (projets)

1 projet = 1 folder = 1 compendium

e.g. with RStudio: use Rstudio projects


    │
    ├── [my_project]
    │   └── my_project.Rproj
    │
    ├── [another_project]
    │   └── another_project.Rproj
    │
    ├── [again_another_project]
    │   └── again_another_project.Rproj
    │

1. Organisation: e.g. R projets

Stop to setwd() !!

Absolute paths (e.g. C:\\Albert\Bureau\PhD) only work on your computer (and not on others).

Use relative paths defined from the root of the project: e.g. outputs/01_datacleaned.csv, data/data_raw.csv

1. Organisation: e.g. R projets


Use the package {here}

In your rapports (quarto, Rmarkdown) so that the code creates absolute paths itself:

# bad !
setwd("good/luck/where/my_project/is/data")
dat <- read.csv("data_raw.csv")]

1. Organisation: e.g. R projets


Use the package {here}

In your rapports (quarto, Rmarkdown) so that the code creates absolute paths itself:

# bad !
# setwd("good/luck/where/my_project/is/data")
# dat <- read.csv("data_raw.csv")]

# better
dat_path <- file.path("data", "data_raw.csv")
dat <- read.csv(here::here(dat_path))

2. Separation of data, method, and output

  • Data files, code files and output files are separated.

  • This separation is materialized by folders.

    .
    ├── my_project.Rproj
    ├── [data]
    └── [outputs]

2. Separation of data, method, and output

  • Data files, code files and output files are separated.

  • This separation is materialized by folders.

    .
    ├── my_project.Rproj
    ├── [data]
    └── [outputs]

Implications

Keeping data and method separate treats the data as “read-only”, so that the original data is untouched and all modifications are transparently documented in the code.

The output files should be considered as disposable, with a mindset that one can always easily regenerate the output using the code and data.

2. Separation of data, method, and output


- The analysis flow (the methods) is split into reusable pieces (fonctions), which are called by analyses scripts:

    .
    ├── my_project.Rproj
    ├── [data] (raw data)
    ├── [R] (fonctions = small pieces of reusable code)
    ├── [analyses] (scripts)
    └── [outputs] (results)

2. Separation of data, method, and output


- The analysis flow (the methods) is split into reusable pieces (fonctions), which are called by analyses scripts:

    .
    ├── my_project.Rproj
    ├── [data]
    ├── [R]
    ├── [analyses]
    └── [outputs]

Careful in R

The folder R should only contain .R files which contain function definitions. Any call in the folder Rwill be executed when calling devtools::load_all() or targets::tar_source().

2. Separation of data, method, and output

  • The analysis flow is split into small steps
  • Those steps (scripts) are numbered and documented
.
├── my_project.Rproj
├── [data]
├── [R]
├── [analyses]
│   ├── 00_setup.R (load packages, global variables)
│   ├── 01_data.R (read and format data)
│   ├── 02_length-weight.R (first analysis)
│   ├── 03_plot-length-weight.R (generate first plot)
│   ├── ...
└── [outputs]

2. Separation of data, method, and output

  • The analysis flow is split into small steps
  • Those steps (scripts) are numbered and documented
############################################################
#
# 00_setup.R: load packages, set global variables
#
############################################################

2. Separation of data, method, and output

  • The relationship between which code operates on which data in which order to produce which outputs must be specified as well.

  • Use a main script (make.R) which executes the different steps in the right order (it’s the only R script at the root of the folder!)

.
├── my_project.Rproj
├── [data]
├── [R]
├── [analyses]
├── [outputs]
└── make.R

2. Separation of data, method, and output

  • Use a main script (make.R) which executes the different steps in the right order (it’s the only R script at the root of the folder!)
############################################################
#
# make.R: build the project
#
############################################################

source("analyses/00_setup.R")

source("analyses/01_data.R")

source("analyses/02_length-weight.R")

source("analyses/03_plot-length-weight.R")

2. Separation of data, method, and output

  • Each script (step) writes results (files) by referring explicitly to their name.
  • The script “analyses/01_data.R” writes results of the type “outputs/01_length-weight_females.RData”
  • The script “analyses/03_plot-length-weight.R” writes results of the type “figures/03_length-weight_males.png”

2. Separation of data, method, and output


    .
    ├── my_project.Rproj
    ├── [data]
    ├── [R]
    ├── [analyses]
    └── [outputs]
    .
    ├── my_project.Rproj
    ├── [data]
    │   ├── [raw_data]
    │   └── [derived_data]
    ├── [R]
    ├── [analyses]
    ├── [figures]
    └── [outputs]

Flexibility

Depending on your project, the corresponding organisation might be more or less complex. Adapt the compendium to your needs.

Limits of the approach

2. Separation of data, method, and output

  • Let’s also separate documents such as papers and presentations
.
├── DESCRIPTION
├── [data]
├── [R]
├── [analyses]
├── [outputs]
├── [syntheses]
├── my_project.Rproj
├── README.md
├── README.qmd
├── renv.lock
└── make.R

2. Separation of data, method, and output

  • Separate documents such as papers and presentations
.
├── DESCRIPTION
├── [data]
├── [R]
├── [analyses]
├── [outputs]
├── [syntheses]
|   ├── paper.qmd
|   └── presentation.qmd
├── my_project.Rproj
├── README.md
├── README.qmd
├── renv.lock
└── make.R

2. Separation of data, method, and output

  • Separate documents such as papers and presentations

  • Add useful resources (biblio, etc …)

.
├── DESCRIPTION
├── [data]
├── [R]
├── [analyses]
├── [outputs]
├── [syntheses]
├── [documents]
├── my_project.Rproj
├── README.md
├── README.qmd
├── renv.lock
└── make.R

3. Specify the computational environment

  • Specify the computational environment that was used for the original analysis.

At its most basic, this could be a plain text file that includes a short list of the names and version numbers of the software and other critical tools used for the analysis. In more complex approaches, described below, the computational environment can be automatically preserved or reproduced as well.

3. Specify the computational environment

  • Specify the computational environment that was used for the original analysis.

Place a README file at the root of the projet.

e.g. write a Rmd or qmd, and compile it in make.R.

############################################################
#
# make.R: build the project
#
############################################################

[...]

source("analyses/03_plot-length-weight.R")

quarto::quarto_render("README.qmd", output_file = "README.md")
  .
  ├── my_project.Rproj
  ├── [data]
  ├── [R]
  ├── [analyses]
  ├── [outputs]
  ├── README.md
  ├── README.qmd
  └── make.R

3. Specify the computational environment

  • Should specify the computational environment that was used for the original analysis.

  • Use a DESCRIPTION file and the package renv for the packages!

.
├── DESCRIPTION
├── [data]
├── [R]
├── [analyses]
├── [outputs]
├── my_project.Rproj
├── README.md
├── README.qmd
├── renv.lock
└── make.R

3. Specify the computational environment

  • Should specify the computational environment that was used for the original analysis.

  • Use a DESCRIPTION file and the package renv for the packages!

DESCRIPTION

Package: fish_length-weight_run
Type: Package
Title: Fish Length-Weight in la Réunion
Version: 0.0.0.9000

Imports: 
    ggplot2,
    qs

make.R

#renv::init()
renv::install()
renv::snapshot()

4. Diffusion / stockage

  • There are a number of online options to store your project.

  • Many are private (e.g. Dryad, https://datadryad.org/)

  • Zenodo (https://zenodo.org/) has been created by OpenAIRE and the CERN in 2013 and allows to upload up to 50 GO.

4. Diffusion / stockage

  • There are a number of online options to store your project.

  • Many are private (e.g. Dryad, https://datadryad.org/)

  • Zenodo (https://zenodo.org/) has been created by OpenAIRE and the CERN in 2013 and allows to upload up to 50 GO.

Research Compendium

    .
    ├── [data]
    |   └── raw-data.csv    
    ├── [R]
    |   └── functions.R
    ├── [analyses]
    |   └── pipeline.R
    ├── [outputs]
    ├── [syntheses]
    |   └── paper.qmd
    ├── [documents]
    ├── my_project.Rproj
    ├── README.md
    ├── DESCRIPTION
    ├── Dockerfile
    ├── renv.lock
    └── make.R

Research Compendium

    .
    ├── [data]
    |   └── raw-data.csv
    ├── [R]
    |   └── functions.R
    ├── [analyses]
    |   └── pipeline.R
    ├── [outputs]
    ├── [syntheses]
    |   └── paper.qmd
    ├── [documents]
    ├── my_project.Rproj (projet)
    ├── README.md
    ├── DESCRIPTION
    └── make.R

Research Compendium

    .
    ├── [data]
    |   └── raw-data.csv (raw data)
    ├── [R]
    |   └── functions.R (fonctions)
    ├── [analyses]
    |   └── pipeline.R (workflow)
    ├── [outputs] (results)
    ├── [syntheses]
    |   └── paper.qmd
    ├── [documents]
    ├── my_project.Rproj (projet)
    ├── README.md
    ├── DESCRIPTION
    └── make.R (setup, workflow)

Research Compendium

    .
    ├── [data]
    |   └── raw-data.csv (raw data)
    ├── [R]
    |   └── functions.R (fonctions)
    ├── [analyses]
    |   └── pipeline.R (workflow)
    ├── [outputs] (results)
    ├── [syntheses]
    |   └── paper.qmd
    ├── [documents]
    ├── my_project.Rproj (projet)
    ├── README.md
    ├── DESCRIPTION (dependences, packages)
    └── make.R (setup, workflow)

Research Compendium

    .
    ├── [data]
    |   └── raw-data.csv (raw data)
    ├── [R]
    |   └── functions.R (fonctions)
    ├── [analyses]
    |   └── pipeline.R (workflow)
    ├── [outputs] (results)
    ├── [syntheses]
    |   └── paper.qmd (article, supp. mat, presentation)
    ├── [documents] (biblio)
    ├── my_project.Rproj (projet)
    ├── README.md
    ├── DESCRIPTION (dependences, packages)
    └── make.R (setup, workflow)

Research Compendium

    .
    ├── [data]
    |   └── raw-data.csv (raw data)
    ├── [R]
    |   └── functions.R (fonctions)
    ├── [analyses]
    |   └── pipeline.R (workflow)
    ├── [outputs] (results)
    ├── [syntheses]
    |   └── paper.qmd (article, supp. mat, presentation)
    ├── [documents] (biblio)
    ├── my_project.Rproj (projet)
    ├── README.md (help)
    ├── DESCRIPTION (dependences, packages)
    └── make.R (setup, workflow)

Research compendium

    .
    ├── [data]
    |   └── raw-data.csv
    ├── [R]
    |   └── functions.R
    ├── [analyses]
    |   └── pipeline.R
    ├── [outputs]
    ├── [syntheses]
    |   └── paper.qmd
    ├── [documents]
    ├── my_project.Rproj
    ├── README.md
    ├── DESCRIPTION
    └── make.R







rcompendium